Introduction to TidyR

Gathering columns into key-value pairs

The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs.

The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw, running the following command on wide_df will make it long:

gather(wide_df, my_key, my_val, -col)

# Read BMI data
library(readr)
bmi <- read_csv("../xDatasets/bmi_clean.csv")

# Apply gather() to bmi and save the result as bmi_long
library(tidyr)
bmi_long <- gather(bmi, year, bmi_val, -Country)

# View the first 20 rows of the result
bmi_long %>%
  head(20) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
Country year bmi_val
Afghanistan Y1980 21.48678
Albania Y1980 25.22533
Algeria Y1980 22.25703
Andorra Y1980 25.66652
Angola Y1980 20.94876
Antigua and Barbuda Y1980 23.31424
Argentina Y1980 25.37913
Armenia Y1980 23.82469
Australia Y1980 24.92729
Austria Y1980 24.84097
Azerbaijan Y1980 24.49375
Bahamas Y1980 24.21064
Bahrain Y1980 23.97588
Bangladesh Y1980 20.51918
Barbados Y1980 24.36372
Belarus Y1980 24.90898
Belgium Y1980 25.09879
Belize Y1980 24.54345
Benin Y1980 20.80754
Bermuda Y1980 25.07881

Notice how now, instead of being represented in the column names, years are now all neatly represented in the year column. Try checking dim(bmi_long) and dim(bmi) before moving on.

Spreading key-value pairs into columns

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read.

The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw, running the following command will make long_df wide:

spread(long_df, my_key, my_val)

# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)

# View the head of bmi_wide
bmi_wide %>%
  head(8) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
Country Y1980 Y1981 Y1982 Y1983 Y1984 Y1985 Y1986 Y1987 Y1988 Y1989 Y1990 Y1991 Y1992 Y1993 Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008
Afghanistan 21.48678 21.46552 21.45145 21.43822 21.42734 21.41222 21.40132 21.37679 21.34018 21.29845 21.24818 21.20269 21.14238 21.06376 20.97987 20.91132 20.85155 20.81307 20.78591 20.75469 20.69521 20.62643 20.59848 20.58706 20.57759 20.58084 20.58749 20.60246 20.62058
Albania 25.22533 25.23981 25.25636 25.27176 25.27901 25.28669 25.29451 25.30217 25.30450 25.31944 25.32357 25.28452 25.23077 25.21192 25.22115 25.25874 25.31097 25.33988 25.39116 25.46555 25.55835 25.66701 25.77167 25.87274 25.98136 26.08939 26.20867 26.32753 26.44657
Algeria 22.25703 22.34745 22.43647 22.52105 22.60633 22.69501 22.76979 22.84096 22.90644 22.97931 23.04600 23.11333 23.18776 23.25764 23.32273 23.39526 23.46811 23.54160 23.61592 23.69486 23.77659 23.86256 23.95294 24.05243 24.15957 24.27001 24.38270 24.48846 24.59620
Andorra 25.66652 25.70868 25.74681 25.78250 25.81874 25.85236 25.89089 25.93414 25.98477 26.04450 26.10936 26.17912 26.24017 26.30356 26.36793 26.43569 26.50769 26.58255 26.66337 26.75078 26.83179 26.92373 27.02525 27.12481 27.23107 27.32827 27.43588 27.53363 27.63048
Angola 20.94876 20.94371 20.93754 20.93187 20.93569 20.94857 20.96030 20.98025 21.01375 21.05269 21.09007 21.12136 21.14987 21.13938 21.14186 21.16022 21.19076 21.22621 21.27082 21.31954 21.37480 21.43664 21.51765 21.59924 21.69218 21.80564 21.93881 22.08962 22.25083
Antigua and Barbuda 23.31424 23.39054 23.45883 23.53735 23.63584 23.73109 23.83449 23.93649 24.05364 24.16347 24.26782 24.36568 24.45644 24.54096 24.60945 24.66461 24.72544 24.78714 24.84936 24.91721 24.99158 25.05857 25.13039 25.20713 25.29898 25.39965 25.51382 25.64247 25.76602
Argentina 25.37913 25.44951 25.50242 25.55644 25.61271 25.66593 25.72364 25.78529 25.84428 25.88510 25.92482 25.99177 26.07642 26.17288 26.27872 26.37522 26.47182 26.57778 26.68714 26.79005 26.88103 26.96067 26.99882 27.04738 27.11001 27.18941 27.28179 27.38889 27.50170
Armenia 23.82469 23.86401 23.91023 23.95649 24.00181 24.04083 24.08736 24.13334 24.17219 24.19556 24.20618 24.19790 24.12982 24.05854 24.02297 24.01570 24.02627 24.03885 24.07100 24.11699 24.18045 24.26670 24.37698 24.50332 24.64178 24.81447 24.99160 25.17590 25.35542

Separating columns

The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument.

We’ve loaded the small dataset called treatments into your workspace. This dataset obeys the principles of tidy data, but we’d like to split the treatment dates into two separate columns: year and month. This can be accomplished with the following:

separate(treatments, year_mo, c("year", "month"))

# Apply separate() to bmi_cc
bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")

# Print the head of the result
head(bmi_cc_clean)

Uniting columns

The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in te new column, but this behavior can be altered via the sep argument.

We’ve loaded the treatments data into your workspace again, but this time the year_mo column has been separated into year and month. The original column can be recreated by putting year and month back together:

unite(treatments, year_mo, year, month)

# Apply unite() to bmi_cc_clean
bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")

# View the head of the result
head(bmi_cc)

Column headers are values, not variable names

You saw earlier in the chapter how we sometimes come across datasets where column names are actually values of a variable (e.g. months of the year). This is often the case when working with repeated measures data, where measurements are taken on subjects of interest on multiple occasions over time. The gather() function is helpful in these situations.

# Read Census data
census <- read_csv("../xDatasets/census-retail.csv")

# View the head of census
census %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1992 146913 147270 146831 148082 149015 149821 150809 151064 152595 153577 153605 155504
1993 157525 156292 154774 158996 160624 160171 162832 162491 163285 164711 166593 168101
1994 167504 169652 172775 173099 172340 174307 174801 177289 178776 180569 180695 181492
1995 182423 179472 180996 181702 183543 186088 185470 186814 187338 186546 189052 190809
1996 189167 192269 193993 194712 196210 196127 196229 196215 198843 200488 200200 201191
1997 202414 204273 204965 203372 201676 204666 207049 207643 208298 208064 208982 209379
# Gather the month columns
library(tidyr)
census2 <- gather(census, month, amount, -YEAR) 

# Arrange rows by YEAR using dplyr's arrange
census2_arr <- arrange(census2, YEAR)

# View first 20 rows of census2
census2_arr%>%
  head(12) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
YEAR month amount
1992 JAN 146913
1992 FEB 147270
1992 MAR 146831
1992 APR 148082
1992 MAY 149015
1992 JUN 149821
1992 JUL 150809
1992 AUG 151064
1992 SEP 152595
1992 OCT 153577
1992 NOV 153605
1992 DEC 155504

♀Variables are stored in both rows and columns

Sometimes you’ll run into situations where variables are stored in both rows and columns. To illustrate this, we’ve loaded the pets dataset from the video, which tells us in a convoluted way how many birds, cats, and dogs Jason, Lisa, and Terrence have. Print the pets dataset to see for yourself.

Although it may not be immediately obvious, if we treat the values in the type column as variables and create a separate column for each of them, we can set things straight. To do this, we use the spread() function. Run the following code to see for yourself:

spread(pets, type, num)

The result shows the exact same information in a much clearer way! Notice that the spread() function took in three arguments. The first argument takes the name of your messy dataset (pets), the second argument takes the name of the column to spread into new columns (type), and the third argument takes the column that contains the value with which to fill in the newly spread out columns (num).

Now let’s try this on a new messy dataset census_long. What information does this tell us?

# View first 50 rows of census_long
head(census_long, n = 50)

# Spread the type column
census_long2 <- spread(census_long, type, amount)

# View first 20 rows of census_long2
head(census_long2, n = 20)

Multiple values are stored in one column

It’s also fairly common that you will find two variables stored in a single column of data. These variables may be joined by a separator like a dash, underscore, space, or forward slash.

The separate() function comes in handy in these situations. To practice using it, we have created a slight modification of last exercise’s result. Keep in mind that the into argument, which specifies the names of the 2 new columns being formed, must be given as a character vector (e.g. c(“column1”, “column2”)).

# View the head of census_long3
head(census_long3)

# Separate the yr_month column into two
census_long4 <- separate(census_long3, yr_month, c("year", "month"))

# View the first 6 rows of the result
head(census_long4, n = 6)